_Name:_Jacob Hanimann
Load all the libraries that you use in the rest of the analyses.
Answer:
library(tidyverse)
library(pheatmap)
library(viridis)
library(ggrepel)
library(DESeq2)
Antisense transcription is a current hot topic in genomics: when a certain genomic region is transcribed on both strands. In one model, the two RNAs produced may hybridize with each other since they are complementray, and then be degraded (a type of negative regulators). In another, deregulation happens becasue of RNA polymerase collisions.
We will now try to answer the question : How common is this in the human genome?
Use the refseq track from the hg19 assembly.
Using R and bedtools (and possibly a text editor to add headers to bed files if needed), find out:
Answer:
First, I downloaded two tables from the table browser in genome browser selecting: the hg19 genome, group = “gene and gene prediction”, track = “NCBI Refseq” and table = “RefSeq All”. By using the filter option, i created one table with only “+” strand transcripts and one with only “-” strand transcripts. Both tables were downloaded in the BED format with 12 columns (organised in the bedtool format).
Next, i used UNIX to to create a new table with an additional column displaying how many overlaps each “+” strand transcript has with “-” strand transcripts (covering at least 20% of the transcript) using the following command:
“betools intersect -a hg19_refseq_plus_txt -b hg19_refseq_minus.txt -c -f 0.2 > plus_minus_intersect_count.txt”
Here as a picture:
Note, that slightly different results would be calculated if you switched the order of “+” and “-” strand in the command line, since there is a threshold of 20%. In the following analysis, I will stick to the case of “+” strand as the reference because both ways show enough similar results when just looking at the fraction that overlap at least one opposite transcript.
First, I read in the data and deleted the columns i do not need and renamed the rows. Then, I calculated the ratio between transcript with overlaps 0< over all “+” strand transcripts through counting rows.
#read in downloaded data from UNIX
#intersect
plus_strand_intersect_count = as_tibble(read_tsv("plus_minus_intersect_count.txt", col_names = FALSE))
#plus strand (just needed for counting total transcript)
plus_strand = as_tibble(read_tsv("hg19_refseq_plus.txt", col_names = FALSE))
#minus strand (just needed for counting total transcript)
minus_strand = as_tibble(read_tsv("hg19_refseq_minus.txt", col_names = FALSE))
#deleting unneccessary columns:
plus_strand_intersect_count[5:12] <- list(NULL)
#naming columns
colnames(plus_strand_intersect_count) <- c("chrom","start","end","name","overlaps")
#calculate ratio between transcript with overlaps 0< over all "+" strand transcripts through counting rows
plus_strand_intersect_count %>% filter(overlaps>0) %>% summarise(n()) / summarise(plus_strand_intersect_count, n())
| n() |
|---|
| 0.0770338 |
#values of the calculation = 3050/39593 = 0.07703382
Apparently, approximately 7.7% of all “+” strand transcript have at least one overlap with a “-” transcript, covering at least 20% of the transcript.
The fraction of all Refseq transcripts overlapping opposite strand transcripts (“+” strands with at least one overlap) / (total number of “+” and “-” transcripts) is:
plus_strand_intersect_count %>% filter(overlaps>0) %>% summarise(n()) / (nrow(plus_strand)+nrow(minus_strand))
| n() |
|---|
| 0.039198 |
3050 of 77810 transcripts, respectively about 3.9 %, have at least one overlap with the opposite strand (calculated with “-” aligned to “+” strands with a threshold of 20%)
Answer:
Because you could get significantly different results depending of which strand you align to which, I calculated the opposite case of “-” strand as a reference followingly with the same pipeline as before:
UNIX command: “bedtools intersect -a hg19_refseq_minus.txt -b hg19_refseq_plus.txt -c -f 0.20 > minus_plus_intersect_count.txt”
I read the new data in and created one large tibble with both datasets combined and annotated each transcript with a column called strand with the value of either “plus” or “minus”.
#read in downloaded data from UNIX "-" as a reference
#intersect
minus_strand_intersect_count = as_tibble(read_tsv("minus_plus_intersect_count.txt", col_names = FALSE))
#deleting unneccessary columns:
minus_strand_intersect_count[5:12] <- list(NULL)
#annotating data with strand orientation
plus_strand_intersect_count = cbind(plus_strand_intersect_count, matrix(c("plus"), nrow = nrow(plus_strand_intersect_count), ncol=1))
minus_strand_intersect_count = cbind(minus_strand_intersect_count, matrix(c("minus"), nrow = nrow(minus_strand_intersect_count), ncol=1))
#naming columns
colnames(minus_strand_intersect_count) <- c("chrom","start","end","name","overlaps", "strand")
colnames(plus_strand_intersect_count) <- c("chrom","start","end","name","overlaps", "strand")
#binding datasets
total_intersect_count = rbind(plus_strand_intersect_count,minus_strand_intersect_count)
I plotted two histogram of overlap counts on the x-axis and counts of the transcript on the y-axis, starting at 1 overlap. I splitted the plots by strand annotation plus and minus (indicating alignment order).
#plotting a histogram of overlaps as the x-axis and counts on the y-axis starting at 1 overlap and a binwidth of 1
total_intersect_count %>% filter(overlaps >0) %>% ggplot(aes(x=overlaps, fill= strand)) + geom_histogram(binwidth = 1) + theme_bw() + facet_grid(~strand, scales = "free")
Looking at the Histogramm plots, it is apparent that most transcript have an overlap of one transcript of the opposite strand and the distribution looks like a exponential decay graph. There are also a few outliners from overlap count 20 to 41 (minus strand) to 81 (plus strand).
Answer:
I arranged the total_intersect_count tibble in a top-down manner with the criteria of overlaps.
total_intersect_count %>% arrange(desc(overlaps)) %>% head(5)
| chrom | start | end | name | overlaps | strand |
|---|---|---|---|---|---|
| chr18 | 34854422 | 34856363 | NR_134588.1 | 81 | plus |
| chrY | 15591056 | 15591146 | NR_162134.1 | 77 | plus |
| chr8 | 53106921 | 53112112 | NR_134310.1 | 54 | plus |
| chr10 | 31650355 | 31688684 | NM_001368169.1 | 41 | minus |
| chr19 | 57323847 | 57325161 | NR_023847.2 | 39 | plus |
Apparently, the transcript with the most overlapping transcript (strand=plus) with a count of 81 is labeled as NR_134588.1, is located on chromosome 18 and spans from position 34854422 to 34856363.
NR_134588.1 is annotated in RefSeq as the exons of the uncharacterized gene LOC105372068 which is classified as a non-coding RNA. On the opposite strand of this locus is the CELF4 gene, which is described as being part of the protein family that regulate pre-mRNA alternative splicing. Evidently, 81 transcript variants encoding different isoforms have been found for this gene.
It is apparent that the CELF4 gene is significantly longer than LOC105372068.
Based on this finding, it would make more sense to count a gene as one overlap and not each transcript of it as a individual count. How many isoforms of a gene exists, does not determine how much it will be transcripted and potentially interfere with the transcript of the opposite strand. Also, this approach to investigate antisense transcription is rather unspecific and vague. I would suggest to analyze tissue specific, since not all genes are expressed in all cell type. Furthermore, the timing of the transcription is also crucial to the potential of interfering with the antisense strand.
The RRP40 gene is part of the exosome complex, a molecular machine that degrades RNAs in the cells from the 3’ end. Your collaborator has just made a CAGE experiment in cells in which RRP40 was depleted using a RRP40-specific siRNA. For comparison,he/she also made a control experiment where a random siRNA was used. The hope is to be able to identify what RNAs that are degraded by RRP40, becasue we should observe higher levels of these if RRP40 is depleted. They have a hypothesis that there may be RNAs transcription initiation close to known gene transcription start sites (TSSs) that we never observe in normal cells becasue the RNAs are eaten up so fast that.
The CAGE reads are already mapped to the genome and another collaborator has already made some files for you that shows where they fall around annotated TSSs. Specifically, you are given 4 files where the rows are the -600 to +400 region around ~12.000 annotated TSSs, and the columns are the positions (so, the first is position -600 etc for respective TSS). Values are TPM-normalized CAGE counts plus a small amount of artificially added noise to avoid model overfitting, aside from the first column which is just the genomic position we are looking at, eg chr4:10000-11000+.
Because you have two experiments and two strands, you have four files in total. For example, the Hw4_CAGE_rrp40_senseStrand file has CAGE data on the plus strand and from the RRP40 depletion experiment. “Strand” is here always relative to the annoatated TSS, which is always defined to be on the plus strand.
These files are quite big (around 12 million data points, or 60-70 megabyte each), and your collaborators belatedly realized they could not plot these using Excel. Panic ensued. This is why they hired you: you know how to use R and your job is now to analyze the data and visualizing the results.
Specifically, they want you to:
Using tidyverse (except when reading in files, see below), make a plot where the Y axis is average fold change (rrp40/ctrl) and X axis is position relative to TSS (-600 to +400). Calculate fold changes for each strand so in the end, you will have a plot with two fold change ‘lines’, one for each strand. Interpret the results: What are we seeing and does this agree with the text-book decription of promoters and transcription start sites? ( max 100 words)
Answer:
After reading the data in, I created two matrices which calculated the fold change of the expression and then the mean of each position relative to the TSS. Then, I annotated the two matrices with the position and strand information (sense/antisense), renamed the column so that I could merge them. In the last step i plotted the data with a color code for thre strand feature.
#read in data
sense_control = as_tibble(read_tsv("Hw4_CAGE_Ctrl_senseStrand.txt"))
antisense_control = as_tibble(read_tsv("Hw4_CAGE_Ctrl_antisenseStrand.txt"))
sense_rrp40 = as_tibble(read_tsv("Hw4_CAGE_rrp40_senseStrand.txt"))
antisense_rrp40 = as_tibble(read_tsv("Hw4_CAGE_rrp40_antisenseStrand.txt"))
#create two new matrix with (rrp40/ctr) to calculate fold change and then the average of each position
average_fc_sense = as_tibble(colMeans((sense_rrp40[,-1]/sense_control[,-1])))
average_fc_antisense = as_tibble(colMeans((antisense_rrp40[,-1]/antisense_control[,-1])))
#create two matrixes for the data with position (1:1000) and strand information (sense\antisense) and then bind it to the average fold change value of the position
#sense
matrix_sense = cbind(matrix(c("sense"), nrow = 1000, ncol=1),matrix(c(1:1000),
nrow = 1000, ncol=1),average_fc_sense)
#antisense
matrix_antisense = cbind(matrix(c("antisense"), nrow = 1000, ncol=1), matrix(c(1:1000), nrow = 1000, ncol=1), average_fc_antisense)
#renaming the column names
colnames(matrix_sense) <- c("strand", "position", "foldchange")
colnames(matrix_antisense) <- c("strand", "position", "foldchange")
#merging both strand data sets to get one tibble
both_strands <- as_tibble(rbind(matrix_sense, matrix_antisense))
#plotting data
both_strands %>% ggplot(aes(x= position, y= foldchange, col= strand))+ geom_point(alpha=0.5) +theme_bw()
According to this plot, there was a notably bigger increase in transcription of antisense RNA compared to the sense RNA when the siRNA depleted the RRP40 gene. Sense RNA transcription was also upregulated, but not that aberrant. This data suggests that the target of the exosome complex component RRP40 is anti-sense RNA nearby and upstream of the TSS. Also, the exosome complex is described (https://www.uniprot.org/uniprot/Q08285) as participating in the elimination of RNA-processing by-products non-coding ‘pervasive’ transcripts, including antisense RNA-species. This annotation matches these observation of significantly more antisense-transcripts near TSS in the RRP40 mutant.
You have been hired by the Danish pharmaceutical giant Novo Nordisk to analyze an RNA-Seq study they have recently conducted. The study involves treatment of pancreatic islet cells with a new experimental drug for treatment of type 2 diabetes. Novo Nordisk wants to investigate how the drug affects cellular mRNA levels in general, and whether the expression of key groups of genes are affected.
As the patent for the new experimental drug is still pending, Novo Nordisk has censored the names of genes.
You have been supplied with 4 files:
studyDesign.tsv: File describing treatment of the 18 samples included in the study.countMatrix.tsv: Number of RNA-Seq reads mapping to each of the genes.normalizedMatrix.tsv: Normalized expression to each of the genes.diabetesGene.tsv: Collection of genes known to be involved in type 2 diabetes.Question 3.1.1: Read all dataset into R, and make sure all three files have matching numbers and names of both samples and genes.
Answer:
count_matrix = read.table(file = 'countMatrix.tsv', sep = '\t')
diabetes_genes = read_tsv("diabetesGenes.tsv")
normalizedMatrix = read.table(file = 'normalizedMatrix.tsv', sep = '\t')
studyDesign = read_tsv("studyDesign.tsv")
Next, we want to see if the data makes sense, by making a heat map and a PCA plot.
Question 3.1.2: Heat map: For heat maps, it makes no sense to include all genes - instead, we will only look at genes that vary substantially across the samples. Specifically, select the genes top 10% of genes based on their variance across all samples, and make a heat map of those using the pheatmap library (standard settings). Rows in the heat mpa should be genes, columns shoudl be samples. Make an annotation row that shows whether each sample is treatment or control. Comment on your plot
Answer:
First, I calculated the variance score for each gene and added it as a column to the normalizedMatrix. Then i created a annotation matrix to later use in the heatmap to differ between treated and control samples. Lastly, i selected the top 10% of the highest variation score and plotted it as a heatmap.
#calculating the variance of each row
variance = normalizedMatrix %>% apply(., MARGIN=1, FUN=var) %>% as_tibble
#adding the variance for each gene to the normalizedMatrix
nmatrix_var = cbind(normalizedMatrix,variance)
#annotation sample or treatment for heatmap
col<-data.frame(studyDesign[,-1])
row.names(col)<- studyDesign$Sample
#selecting the top 10% genes with the highest variance and plotting it with heatmap
nmatrix_var %>% arrange(desc(value))%>% top_n(.,(round(0.1*nrow(.)))) %>% select(-value)%>% pheatmap(.,color = magma(10),annotation_col =col)
The clustering of the expression patter divides the samples in two groups which matches the criteria of control and treated. There are two exception to this pattern which are Sample 6 (control) and Sample 18 (treated). It looks like the annotation of these samples were switched.
Question 3.1.3: PCA: Using the normalized matrix (all genes, not the top 10% of genes as in the heat map), perform a Principal Components Analysis (PCA) on the samples and produce a PCA-plot of the two first components, where the axis labels show the amount of variance explained by each component and samples are colored by their experimental group. Find a way to label the samples, so the identity (the sample name) of each point can easily be seen (hint: look at geom_text() or the ggrepel package!). Note, you should center but not scale the data. Comment on your plot
Answer:
I performed a pca by transposing the matrix so that the genes function as dimensions.
#perform pca on normallizedMatrix and looking at summary
pca_exp = normalizedMatrix %>% t() %>% prcomp(.,center=TRUE, )
#extracting variance
percent_variance_exp <- summary(pca_exp)$importance["Proportion of Variance",] * 100
#annotate data with sample identity and then plot
as_tibble(pca_exp$x) %>% bind_cols(studyDesign) %>%
ggplot(.,aes(y= PC2, x=PC1, col=Condition, label= Sample))+ ggrepel::geom_text_repel() + geom_point()+ theme_bw() + xlab(label = paste("PC1", percent_variance_exp[1])) + ylab(label = paste("PC2", percent_variance_exp[2]))
The PCA shows like the heatmap a distinct clustering between control and treated samples. Again, Sample 6 and Sample 18 are located in the opposite group, which support the hypothesis of these two samples being wrongly annotated (switched).
I did not observe any grouping in the PC2/PC3 plot which I could make sense of.
Question 3.1.4: Based on the two previous questions, discuss (max 50 words) whether your observations indicate that there are any problems with the data - e.g. outliers, mix ups, sub-groups. If you identified problems try to fix them (e.g. remove clear outliers if you find them, fix mix-ups, etc ). If you make a correction, make a PCA plot with your corrected data to check that the correction is doing the right thing
Answer:
Like metioned twice before, I suggest that the classification of Sample 6 (Ctrl) and Sample 18 (Trt) were mixed up and that they are in fact part of the other group.
I corrected the annotation by creating corrected version of the studyDesign data frame:
#corr stands for corrected
corr_studyDesign = studyDesign
corr_studyDesign[6,2] = "Trt"
corr_studyDesign[18,2] = "Ctrl"
#plotting PCA with corr_studyDesign as annotation_matrix
as_tibble(pca_exp$x) %>% bind_cols(corr_studyDesign) %>%
ggplot(.,aes(y= PC2, x=PC1, col=Condition, label= Sample))+ ggrepel::geom_text_repel() +geom_point()+ theme_bw() + xlab(label = paste("PC1", percent_variance_exp[1])) + ylab(label = paste("PC2", percent_variance_exp[2]))
With the new annotation the samples are smoothly seperated by their classification sample/control.
Question 3.2.1: Use DESeq2 to obtain differentially expressed (DE) genes between the two experimental conditions. Use default parameter, except use a logFC threshold of 0.25 and an adjusted P-value threshold of 0.05. How many up-and down regulated genes are there on your corrected data compared to if you do the Deseq2 analysis on un-corrected data?
Answer:
I conducted an analysis with corrected (corr) and uncorrected data (uncorr). First, I saved the data as DESeqDatSet-object. Then, I ran the DESeq analysis and displayed the results with the wanted parameters with the result function and subsequently with the summary function of the library.
#save data as DESeqDatSet-object
#uncorrected
dds_uncorr <- DESeqDataSetFromMatrix(countData = count_matrix,
colData = studyDesign,
design = ~ Condition)
#corrected
dds_corr <- DESeqDataSetFromMatrix(countData = count_matrix,
colData = corr_studyDesign,
design = ~ Condition)
#run DESeq
dds_uncorr <- DESeq(dds_uncorr)
dds_corr <- DESeq(dds_corr)
#see results with adjusted parameters: contrast states in which order the DE should be calculated (Treatment]\Control)
#uncorrected
res_uncorr <- results(dds_uncorr,
contrast=c("Condition", "Trt", "Ctrl"),
lfcThreshold=0.25, # logFC cutoff, instead of 0
alpha=0.05) # adjusted FDR cutoff
#corrected
res_corr <- results(dds_corr,
contrast=c("Condition", "Trt", "Ctrl"),
lfcThreshold=0.25, # logFC cutoff, instead of 0
alpha=0.05) # adjusted FDR cutoff
#Summary of uncorrected Data set
summary(res_uncorr)
##
## out of 5669 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0.25 (up) : 160, 2.8%
## LFC < -0.25 (down) : 194, 3.4%
## outliers [1] : 0, 0%
## low counts [2] : 550, 9.7%
## (mean count < 2)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
#Summary of corrected Data set
summary(res_corr)
##
## out of 5669 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0.25 (up) : 653, 12%
## LFC < -0.25 (down) : 873, 15%
## outliers [1] : 0, 0%
## low counts [2] : 110, 1.9%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
In the uncorrected data the analyses states 160 (2.8%) upregulated and 194 (3.4%) downregulated genes, whereas the corrected data results in 653 (12%) upregulateda and 873 (15%) downregulated genes. This is significantly more and crucial for further analysis.
Question 3.2.2: From now on, we will only analyze the corrected data. Convert the output of DESeq2 (corrected data) to a tibble, and make an MA-plot using ggplot2. The MA-plot should show the overall trend using a trend line and genes should colored according to their DE status. Discuss whether the MA-plot indicates an approriate DESeq2 analysis (max 70 words discussion).
Answer:
#transfrom the DESeq data format to a tibble
res_corr_tibble <-res_corr %>%
as.data.frame %>%
rownames_to_column("Gene") %>%
as_tibble
#MA-plot from tibble
res_corr_tibble %>% ggplot(., aes(x=baseMean, y=log2FoldChange, col = ifelse(padj > 0.05 & padj==NA_character_,"",'significant'))) +
geom_point(alpha=0.5) + geom_smooth(col="blue") +
scale_x_log10() +
geom_hline(yintercept = 0, alpha = 0.75,
color="red")+ theme_bw() + theme(legend.position = "none")
The MA-plot indicates that proper DE analysis can be conducted, since it seems like the normalization was successfull because the gene distribution is around log2F = 0, which is the assumption that is made. Also, there is a decently amount of significant foldchanges and not too many. There is an outlier up-right in the plot which contributes to the slope of the trend line.
Question 3.2.3: Sort the DE statistics table that you get from DESeq2 to report the top 10 genes sorted by
a) positive logFC (highest on top)
#arrange the tibble top-down, log2FoldChange as a criteria
res_corr_tibble %>% arrange(desc(log2FoldChange)) %>% head(10)
| Gene | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| Gene861 | 32093.778217 | 5.832809 | 0.4546667 | 12.278903 | 0.0000000 | 0.0000000 |
| Gene4975 | 8.280601 | 5.598001 | 1.4776725 | 3.619206 | 0.0002955 | 0.0023170 |
| Gene4582 | 1104.622721 | 5.520796 | 0.4297880 | 12.263712 | 0.0000000 | 0.0000000 |
| Gene5510 | 11.872892 | 5.458462 | 1.3509775 | 3.855329 | 0.0001156 | 0.0010823 |
| Gene541 | 24.100684 | 5.127973 | 0.7414828 | 6.578672 | 0.0000000 | 0.0000000 |
| Gene2038 | 65.191615 | 5.119450 | 0.9021305 | 5.397722 | 0.0000001 | 0.0000022 |
| Gene1950 | 3.141275 | 4.844799 | 1.1362548 | 4.043810 | 0.0000526 | 0.0005622 |
| Gene4259 | 18.900008 | 4.816041 | 1.2009566 | 3.802003 | 0.0001435 | 0.0012828 |
| Gene5215 | 467.146864 | 4.573887 | 0.4473346 | 9.665890 | 0.0000000 | 0.0000000 |
| Gene1910 | 869.338646 | 4.548284 | 0.4964889 | 8.657362 | 0.0000000 | 0.0000000 |
b) negative logFC (lowest on top)
#arrange the tibble down to top, log2FoldChange as a criteria
res_corr_tibble %>% arrange(log2FoldChange) %>% head(10)
| Gene | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| Gene2502 | 28.830140 | -6.018185 | 0.6684912 | -8.628662 | 0.00e+00 | 0.0000000 |
| Gene5604 | 5.274646 | -5.504877 | 0.9705110 | -5.414547 | 1.00e-07 | 0.0000021 |
| Gene5099 | 6.342716 | -5.449355 | 0.8730538 | -5.955365 | 0.00e+00 | 0.0000001 |
| Gene1613 | 7.846785 | -5.431049 | 0.9441293 | -5.487648 | 0.00e+00 | 0.0000015 |
| Gene3160 | 4.893259 | -5.392966 | 0.9065088 | -5.673376 | 0.00e+00 | 0.0000006 |
| Gene3593 | 7.044358 | -5.271785 | 1.0319123 | -4.866485 | 1.10e-06 | 0.0000238 |
| Gene77 | 3.892940 | -5.056671 | 1.0754914 | -4.469279 | 7.80e-06 | 0.0001232 |
| Gene4597 | 3.065637 | -5.042683 | 1.0635378 | -4.506359 | 6.60e-06 | 0.0001063 |
| Gene1056 | 98.450449 | -4.981815 | 0.4486604 | -10.546540 | 0.00e+00 | 0.0000000 |
| Gene5282 | 3.681568 | -4.973660 | 1.1100428 | -4.255385 | 2.09e-05 | 0.0002643 |
only looking at significantly differentially expressed genes
#arrange the tibble top-down, padj as a criteria
res_corr_tibble %>% arrange(padj) %>% head(10)
| Gene | baseMean | log2FoldChange | lfcSE | stat | pvalue | padj |
|---|---|---|---|---|---|---|
| Gene861 | 32093.77822 | 5.832809 | 0.4546667 | 12.278903 | 0 | 0 |
| Gene4582 | 1104.62272 | 5.520796 | 0.4297880 | 12.263712 | 0 | 0 |
| Gene1056 | 98.45045 | -4.981815 | 0.4486604 | -10.546540 | 0 | 0 |
| Gene5215 | 467.14686 | 4.573887 | 0.4473346 | 9.665890 | 0 | 0 |
| Gene2676 | 78.59401 | 4.353638 | 0.4328948 | 9.479528 | 0 | 0 |
| Gene1684 | 116.26014 | 4.510864 | 0.4586087 | 9.290849 | 0 | 0 |
| Gene423 | 510.03984 | 4.161737 | 0.4305440 | 9.085569 | 0 | 0 |
| Gene3031 | 145.59970 | -3.639055 | 0.3866102 | -8.766079 | 0 | 0 |
| Gene3126 | 37.24829 | -4.725719 | 0.5141028 | -8.705883 | 0 | 0 |
| Gene1910 | 869.33865 | 4.548284 | 0.4964889 | 8.657362 | 0 | 0 |
Question 3.3.1: Novo Nordisk claims their treatment affects expression of genes related to diabetes. Your task is to investigate whether this is true. They have supplied you with a long list of genes that are diabetes-related - diabetesGenes.tsv. Are these genes more up/down regulated than expected by chance, by looking at log2FC values from above ?
Answer:
To visualize the log2FC, p-value and adjusted p-value of these diabetes-related genes i made a volcano-plot.
#filter the diabetes_genes out of the result tibble from the DESeq analysis and plotting the volcano plot
res_corr_tibble %>% filter(Gene %in% diabetes_genes$Gene) %>% ggplot(., aes(x=log2FoldChange,
y=-log10(pvalue), color=padj < 0.05)) +
geom_point(alpha=0.5) +
geom_vline(xintercept = 0, alpha = 0.75,
linetype="dashed")+theme_bw()
It is visible that there are genes which are significantly up/down regulated in regards to the multiple testing adjusted p-value (padj) being < 0.05. Next, I investigated if this treatment affects specifically diabetes-related genes or affect non-diabetes-related genes the same way.